Using Partial Dependence Profiles and the Profile Disparity Index to Detect and Address Data Drift in Machine Learning Models.
1,2 Department of Statistics, Eskisehir Technical University
Machine learning (ML) is increasingly being integrated into diverse fields, from healthcare to finance. These applications often involve dynamic production environments where data continuously evolves. This phenomenon, known as data drift, can undermine the performance of ML models, leading them to produce misleading results. data drift arising from these two factors has been classified as virtual concept drift and true concept drift (Celik et al., 2022). Virtual concept drift occurs when the distribution of input data changes over time, but the relationship between predictors and the response remains unchanged. True concept drift occurs when the actual relationship between the predictors and the response changes. However, EXplainable Artificial Intelligence (XAI) tools, as showcased by Biecek & Burzykowski (2021), have emerged as promising solutions. Specifically, the Partial Dependence Profile (PDP) elucidates how predictors influence model predictions, and the Profile Disparity Index (PDI) allows for a comparative analysis of these profiles over time. These tools emphasize the value of understanding and managing data drift in ML models.
Data drift refers to the situation where, even though the relationship between \(X\) and \(y\) is well modeled at time \({t_0}\) , the model fails to adequately explain the relationship at any time \({t_i}>0\) due to changes in this relationship.
\[\begin{equation} \exists X: P_{t_0}(y \mid X) \neq P_{t_i}(y \mid X), i>0 \end{equation}\]
Figure 1: Drift Types
The partial dependence profile (PDP) calculates the impact of the variables used in the model on the predicted values, based on their marginal distributions (Biecek & Burzykowski, 2021).
In this context, \(f()\) represents the trained model, and \(x_{\underline{i}}^{j \mid=z}\) indicates the value of the \(j\). variable in the state \(z\). \[ \widehat{g_{P D}^j}(z)=\frac{1}{n} \sum_i^n f\left(\underline{x}_i^{j \mid=z}\right) \]
Here, \(f()\) represents the trained model, and \(x_i^{j \mid=z}\) indicates the value of the \(j\). variable in the state \(z\). The profile disparity index (PDI) calculates dissimilarity between two PDP based on their shapes (Kobylinska et al., 2023).
\[\begin{equation} \widehat{PDI}(\widehat{g_{f_1}^j}, \widehat{g_{f_2}^j}) = \frac{1}{m} \sum_{i=1}^{m} I(\text{sgn}(\text{der}(\widehat{g_{f_1}^j})[i]) \neq \text{sgn}(\text{der}(\widehat{g_{f_2}^j})[i])) \end{equation}\]
Here, \(m\) consecutive points of the profile for the \(k^{t h}\) model. The term \(\operatorname{der}(\widehat{g_{f_1}^k})[i]\) represents the \(i^{t h}\) element of the vector derivative for model \(f_k\) and predictor \(j\). The PDI range is [0,1]: a value of zero indicates identical curves, while a value of one signifies distinctly different curves.
For a classification problem, during the simulation, datasets were generated for different types of drift. These datasets were segmented using various window sizes. A model, specifically a random forest, was initially established from the first segmented data. Subsequent PDPs for the other datasets were derived based on this model. The disparities between these PDPs were quantified using the PDI values and subsequently visualized in Figure 2.
Figure 2: Drift Types
When examining the differences between PDPs for various types of data drift using the PDI metric, it was observed that as the mean of the variable \(X_1\) in the dataset changes, there is a corresponding increase in the PDI values. This suggests that the alterations in the mean value of \(X_1\) have a significant impact on the disparity between the PDPs, highlighting the sensitivity of PDI in capturing these variations.
Celik, B., Singh, P., and Vanschoren, J. 2022. “Online AutoML: An adaptive AutoML framework for online learning”. Machine Learning, 1-25.
Kobylińska, K., Krzyziński, M., Machowicz, R., Adamek, M., & Biecek, P. (2023). Exploration of Rashomon Set Assists Explanations for Medical Data. arXiv preprint arXiv:2308.11446.
Biecek, P., Burzykowski, T. 2021. “Explanatory model analysis”, Chapman and Hall / CRC Press, ISBN: 9780367135591.
The work on this paper is financially supported by the Scientific and Technological Research Council of Türkiye under the 2210C with grant no. 1649B022303919. Additionally, support was provided by the Eskisehir Technical University Scientific Research Projects Commission under grant no. 22LÖT175.
Using Partial Dependence Profiles and the Profile Disparity Index to Detect and Address Data Drift in Machine Learning Models.